Transformer Architecture

I learned about how transformer actually work. How today's AI model's are based off of this architecture.

It was interesting and beautiful.

From input strings to numbers by a process called embedding , before embedding the prompt is tokenized. during embedding it embeds the string as well as its position and adds these vocab embedding and position embedding.

this newly formed matrix is called residual stream. the transformer architecture is mostly linear. but there is non linearity in the MLP layer.

the residual stream then enters transformer blocks, which consist of layernorms, which normalizes the incoming matrix to suit to each attention head, which capture each feature of the residual stream by extracting query, figuring out it's key and finally multiplying with the value and adding with the residual stream, this addition is to make sure that the gradient doesn't vanish , it's also called as a skip connection. It preserves the original data by making a small change delta that is caused by attention heads, all this happens before heading out to the MLP layer which has it's own layernorm, where the newly formed residual matrix is normalized and moved on to the feed forward neural network where the activation function is mostly gelu or much better variation of it, upon entering the matrix is multiplied by the weight, then activation , then matrix out and then again added to the residual stream. this is an operation in one transformer block.

there are 12 transformer blocks in which i did, but modern AI models have more than 50 blocks, this is exciting, i may tinker around an open model to figure out how transformer block helps in AI memory.. it's a fun project idea

after coming from the transformer block, gets unembeded ot vocabulary logits which is converted to probabilities to highest probability of the next word, then out of numbers into readable strings.